Analysis of evolution through competitive selection 
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Recent studies of in vitro evolution of DNA via protein binding indicate that the evolution behavior 
is qualitatively different in different parameter regimes. 1 here present a general theory that is valid 
for a wide range of parameters, and which reproduces and extends previous results. Specifically, 
the mean-field theory of a general translation-invariant model can be reduced to the basic diffusion 
equation with a dynamic boundary condition. The simple analytical form yields both quantitatively 
accurate predictions and valuable insight into the principles involved. In particular, 1 introduce a 
cutoff criterion for finite populations that illustrates both of these qualities. 



I. INTRODUCTION 

In modern biotechnology, evolutionary techniques have 
found many new applications; in particular, in vitro evo- 
lution has been widely used to evolve DNA [1] , RNA [2] 
and proteins [3]. In order to improve the efficiency of 
such evolution, a quantitative understanding of the pro- 
cess would be helpful. Key questions include how the 
rate of evolution and the equilibrium genotype distri- 
bution depend on the main parameters of the process, 
such as the mutation rate, population size and selection 
strength. Additionally, in vitro evolution is of particular 
interest to theorists: Unlike most biological systems, the 
physical processes involved in in vitro evolution can be 
fully characterized [4] , thus such evolution allows reliable 
comparison between theory and experiment. 

A common assumption in the traditional study of evo- 
lution is that each genotype is associated with a fixed 
fitness, and the reproduction rate for an individual is 
given by the individual's fitness relative to the average 
fitness of the population [e.g. [5-8]). For asexual mod- 
els of this type, the dynamics are fairly easily solved for 
very small populations or for infinite populations. How- 
ever, even with very simple assumptions regarding the 
fitness landscape, the evolution dynamics of a large but 
finite population far from equilibrium turned out to be a 
difficult problem, as the evolution rate diverges for large 
populations [9] ; only recently have general analytical re- 
sults emerged [10]. 

An alternative mode of evolution; evolution through 
competitive selection, or just "competitive evolution" ; 
has recently been used to model in vitro molecular evo- 
lution [4, 11, 12]. In competitive evolution, the selection 
process is separate from the reproduction/amplification 
process: the entire population is amplified by a given fac- 
tor K, and during selection, only the "best" 1/K frac- 
tion of the population is kept. In molecular evolution, 
the property selected for is typically how well a molecule 
binds to some target, and the selection is then accom- 
plished by keeping only the molecules that are bound by 
the target. A model experiment of this type of evolu- 
tion, specifically in vitro evolution of DNA via protein 
binding, was recently performed by Dubertret et al. [13]. 



Although in vitro evolution is the most obvious applica- 
tion, competitive evolution may be an appropriate model 
for many types of asexual^ breeding; indeed, any system 
in which fixed relative reproduction rates are not the pri- 
mary evolutionary force is a potential candidate. 

Unlike fixed-fitness evolution, competitive evolution 
has been shown to have a well-behaved mean-field the- 
ory which allows relatively simple, accurate predictions 
of the evolution dynamics [4, 11]. However, these anal- 
yses are valid only for limited parameter regions, either 
for high mutation rate and weak selection [11] or for low 
mutation rate and strong selection [4]. In this paper, I 
introduce a translationally invariant^ model of compet- 
itive evolution. Using appropriate transformations, this 
model can be reduced to the basic diffusion equation with 
a dynamic boundary condition, which gives a universal 
set of equations. This formulation is valid for a broad 
parameter range, and its simplicity makes it a powerful 
tool both for quantitative predictions and for qualitative 
understanding of the evolution process. 

The paper is organized in the following way: In sec- 
tion II I introduce the translationally invariant model 
of competitive evolution. Section III describes how this 
model can be reduced to the basic diffusion equation with 
a dynamic boundary condition and how this approach 
gives improved versions of results from earlier models. 
Section IV uses this formulation to find the basic pop- 
ulation dynamics, and sections V-VII deal with correc- 
tions to the various assumptions I make, namely for finite 
population size, finite length, and finite temperature, re- 
spectively. 



II. THE MODEL 

While competitive evolution can occur in many differ- 
ent contexts, I use in vitro evolution of DNA via protein 
binding as an example. (I will use the corresponding ter- 



^ Competitive evolution with recombination has recently been 
studied in [12]. 

^ As described below, this correponds to an infinite-length model. 
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minology throughout the paper. The following is a rough 
list of alternate terminology: "base" = "locus" / "allele" , 
"binding energy" = "phenotype" / "character" , "DNA se- 
quence" = "genotype" / "chromosome" , and "molecule" 
= "chromosome" / "population member" .) 



A. The evolution process 

An initial population of N DNA molecules is subjected 
to consecutive cycles (iterations) of evolution, where each 
evolution cycle consists of amplification and mutation fol- 
lowed by selection through protein binding. First, the 
population is amplified by a factor K. Each base of 
each DNA molecule is then mutated with some proba- 
bility which may depend on the position and type of the 
base only (i.e., mutations are independent). Finally, the 
strongest protein binders are selected: Each molecule is 
kept (bomid) with probability P[E{S)] = ^_^^,a[E(s)-^ ] ; 
where E{S) is the binding energy of DNA sequence S, 
13 = -j^^, and the chemical potential of unbound pro- 
tein, /i, is tuned such that the expected number of se- 
lected molecules is N: {P[E{S)]) ^<P = 1/K. 

During amplification, all original molecules are kept, 
and {K — 1)N new molecules are added, each an exact 
copy of a randomly chosen member of the original pop- 
ulation. This approach reduces statistical noise in the 
limit of weak amplification/selection K ^ 1; for K ^ 1 
this detail is irrelevant. 

Note that low E indicates strong binding, and the pop- 
ulation is improving if the evolution rate is negative. This 
is the same sign convention as used in [4, 11], but opposite 
that in [12]. For purposes of describing different evolu- 
tion rates ( "slower" / "faster" ) , I assume that the average 
phenotype is improving at all times, i.e., the evolution 
rate is negative. 



B. Analytical model 

A crucial factor in determining the dynamics of an evo- 
lution process is the fitness landscape, or, for competitive 
evolution, the binding energy landscape. In order to ob- 
tain analytical results, a simple landscape is required, 
and a common assumption is that the binding energy 
of a DNA sequence is given as the sum of contributions 
from the individual bases [11, 14, 15]. Indeed, experi- 
ments have show this to be a very good approximation 
for the Mni-repressor system [16]. However, one can by- 
pass this question by ignoring the genotype completely: 
instead, I describe a DNA molecule only by its binding 
energy, which can take any real value. Additionally, I 
consider an infinite- length model, i.e. I assume that the 



mutation rates arc translationally invariant"^ (but other- 
wise general^) the chance of changing the binding en- 
ergy by some amount e during the mutation step is inde- 
pendent of the initial binding energy of the molecule (or, 
in general, of the genotype of the molecule). As shown 
in section VI, the resulting formulation can yield highly 
accurate results also for finite systems. 

The natural energy scale of the selection process is 
given by the temperature through /3. For evolution via 
protein binding, T is the actual temperature, while for 
other types of competitive evolution, T is a general pa- 
rameter describing the accuracy with which the best phe- 
notypes are selected. Note that the binding energy E is 
assumed to be a deterministic function of the genotype 
S; if the phenotype has a random component, e.g. due to 
environmental effects, then this should be incorporated 
in T rather than in E. 

As discussed elsewhere [11, 17, 18], replacing a smooth 
selection function by a step function P{E) = 0(/x — E) 
will in many cases give accurate results; this corresponds 
to the zero temperature limit T = 0. I use this approx- 
imation in sections III- VI; in section VII I discuss the 
effects of finite temperature. 



C. Simulations 

While the general formulation given above is conve- 
nient for analytical treatment, it is difficult to simulate 
efficiently. For an infinite length system, I discrctize the 
binding energy; E = meo, where eo is given in units of 
ksT, or, for T = 0, is arbitrarily set to 1. To simplify 
further, I restrict mutations to change the binding en- 
ergy by either +eo (deleterious) or — eq (beneficial). The 
number of beneficial mutations received by one sequence 
in an iteration is Poisson distributed with average p_, 
and similarly the average number of deleterious muta- 
tions is p-i- . The model is thus described by five param- 
eters: Population size N, temperature T, amplification 
K, and mutation rates p- and p+. For the simulation 
results shown in Figs. 2 and 7, p- = 0.27, p+ = 0.89, 
K = 1.2 and T = 0. 

For simulations of finite systems, I use the two-state 
model [14] also used in Ref. [11]: At each position, one 
of the A = 4 possible bases (A, C, G and T) is a match, 
while the other three are equally unfavourable and con- 
tribute eo to the binding energy. The binding energy of a 



^ This assumption is satisfied in the limit of an infinitely long DNA 
molecule for which the binding energy is the sum of contribu- 
tions from individual bases, where each base has an infinitesimal 
chance of being mutated in a given iteration, and the initial pop- 
ulation consists of molecules that are typical for the given binding 
energies. 

* While infinite-length models have been studies before [5, 11, 21, 
23] , these are typically the translationally invariant limits of spe- 
cific finite models. 
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DNA sequence is then given by the number of mismatches 
r: E(r) = reo- The mutation rates between the differ- 
ent bases are assumed to all be equal, thus the chance 
of correcting a mismatch is one third of the chance i^o of 
introducing a mismatch. To further facilitate comparison 
with Ref. [11], I also use the same length L = 170 and 
mutation rate vq = 0.01 unless specified, thus the average 
number of mutations per DNA molecule per iteration is 
i^oL = 1.7. 

By keeping track only of the number of molecules that 
have a given discrete binding energy, it is possible to per- 
form simiilations of very large populations (up to 10^*^ 
or more) as well as direct mean- field simulations, limited 
only by the numerical precision of the computer. In order 
to efficiently simulate very large populations, I use some 
approximations when calculating the statistics: The rel- 
evant distribution is the binomial distribution B{n,p), 
giving the number of molecules that has some property, 
where the total number of molecules is n and the proba- 
bility of having the property is p < i. If n < 2000 or the 
average value np is less than 100, 1 use the exact binomial 
distribution, otherwise I use a Gaussian distribution with 
the appropriate average and variance, and for np > 10^*^ 
I ignore fluctuations completely. Furthermore, I cut off 
the Possion distributions of the mutations (beneficial and 
deleterious separately) at probabilities e~^°°. 



III. A SIMPLE FORMULATION 

For sufficiently large populations, a mean-field ap- 
proach is valid [4, 11], i.e., we can use determinis- 
tic equations for the population density nt{E), where 
nt{E')dE' is the fraction of the population that has 
binding energy < i? at time t. While the actual evolution 
proceeds in discrete time steps (iterations) , the following 
continuum differential equation is a good approximation 
(see appendix A for details): 



dtnt{E) 



f 

O — ( 



p{t)\nt(E~t)-nt{E)\dt^knt{E) (1) 



with the boundary condition 



nt{ix{t)) = 0. 



(2) 



The integral term in Eq. (1) describes the mutation pro- 
cess, where y»(e) is the rate of mutations that change the 
binding energy by e, while the second term gives expo- 
nential growth and corresponds to the amplification pro- 
cess, with k = \m{K). The boundary condition enforces 
perfect selection by removing all molecules with binding 
energy above the chemical potential /x(t), which is deter- 
mined by the normalization condition for the population, 

r«n,(£;)d£ = l. 

Assume there exists a mutation that can improve the 
binding energy, i.e., that p(t) > for some e < 0. It is 



then^ easily shown that, for any fc > 0, there is a solution 
n°{E) (x{E- cot)e«°(-®-^°*\ (3) 
with ao and cq given by the conditions 

/oo 
(e-"o^ - l)p{e)de + k (4) 
-oo 



and 



Co 



ee-°'°'p{e)de. 



(5) 



For a general solution with the same exponential factor, 
nt{E) = nt(£;)e"«(^-^o*), Eq. (1) simplifies to 

/OO 
e-''°'p{e)[nt{E-e)-ntmde (6) 
-OO 

« -CodEnt{E)+-fdlnt{E), (7) 

where 7 = / ^e~"°'^p(e)de. Changing coordinates to 
E = E - Cot yields 

dtht{E)~ldlfit{E), nt(/~i(t)) -0, (8) 

where /i(t) = ^{t) — Cot. The only nontrivial part left is 
the normalization condition 



/ ^ 

J —OO 



nt{E)e°"'^dE = 1. 



(9) 



Equation (6) describes the mean field of a random walk 
with a mutation/transition rate that has been exponen- 
tially rescaled, p{e) = e~°'°'^p{e), and Cq and 7 are the cor- 
responding drift and diffusion coefficients, respectively. If 
fo is the characteristic energy scale for the mutations, 
then the condition aoeo 1 (Q!oeo 3> 1) specifics the 
regime of high (low) mutation rate/weak (strong) am- 
plification for which the results in [11] ([4]) apply (see 
section III A for details). 

The only parameters left in Eqs. (8~9) are ao and 7. 
Appropriate rescaling of energy and time yields a uni- 
versal set of equations, which immediately gives a uni- 
versal shape for the population distribution, as shown 
in section IV. Additionally, the simplicity of the new 
formulation — the basic diffusion equation with a dynamic 
boundary condition — allows us to apply all our knowl- 
edge and experience about the diffusion equation to this 
evolution process. The next several sections show how, 
through simple arguments, we can use this formulation 
to find accurate corrections to the various approxima- 
tions/assumptions we have made. Put together, this al- 
lows a thorough understanding of competitive evolution 
as well as a very accurate quantitative description that 
is valid for a broad parameter region. 



Otherwise, there is a nontrivial solution only for k < J p(e)de. 



A. Recovering old results 

While Eqs. (4 5) for ao and cq can not in general b( 
solved analytically, they arc nianagcblc for certain mod- 
els. For the discrete, infinite-length (cq = 1) model de- 
scribed in section II C, the mutation dynamics can b« 
fully described by the diffusion coefficient D and the drifl 
speed V [11]: 



pie) = {D+l)S{e-l) + {D-l)S{e+l). 
These are then related to cq, ckq and 7 through 

D = 7cosh(ao) + ^ sinh(ao) 

V = Co cosh(ao) + 27sinh(Q:o) 

k = 27[cosh(ao) — 1] + co[sinh(Q:o) — Q!o]. 



(10; 



(11; 
(12; 
(13; 



For sufficiently small ao (for which ao ~ y/k/D), cq can 
be expressed as a power series in k/D and v/D: 



vk 



co = v- 2VkD + — 



(14) 



where the two first terms give the solution found in [11], 
with k = - — —■ 



On the other hand, in the limit of low mutation rates 
and strong selection, it is more convenient to write the 
mutation rate density as p{e) = p+5(e— 1) + p-6{e + l). 
For ao » 1 the exponential rescaling p(e) = e~°'°^p{e) 
makes p+ irrelevant, and we find 



Co 



-k 



ao - 1 



In 



1 



(15) 



which is a more accurate version of Eq. (6) in [4], with 
k = ln(2) and p- = These results confirm qual- 

itative differences between different parameter regimes: 
For aoeo ^ 1, the evolution rate depends strongly on 
the mutation rate, while for aoCo ^ 1) the dependence is 
only logarithmic. 



Useful tools 



The following will be used later: 

• Equation (3) describes a population that extends 
to arbitrarily low energies. However, if we impose 
a second boundary condition nt[iJ,{t) — j] = and 
require UtiE) > for /x(t) - f < E < fi{t), then 
there is a bounded solution nt{E) oc sin(fe[/i(f) — 
jTjj'jQabliJ.W-E] ^Yig,t movcs at constant speed Cb, 
where 



Cfc ~ Co 



75^ 

ao 



(16) 



In the limit b 
tion. 



we recover the unbounded solu- 




FIG. 1: (a) The universal shape fo{x) of the propagating 
pulse, (b) The rescaled formulation fo{x) = f(x)e~^ for 
which the mean-field dynamics are purely diffusive. 



As the function ni obeys a simple diffusion equa- 
tion, it will be very smooth, except possibly at 
very early times. The population density nt{E) = 
nt{E)e^ contains a factor that varies exponentially 
with E, thus a linear expansion of ht around the 
boundary will give a good estimate of the popula- 
tion size: 



Popsize(nt) 



dEME)\E-. 



Hi)- 



(17) 



The normalization condition (9) then relates the 
posistion of the boundary to the slope of nt{E) at 
the boundary: 



-ao/I(t) 



(18) 



IV. THE PROPAGATING PULSE 



As argued above, fit will usually be very smooth on the 
characteristic energy scale 1/ao, thus a linear expansion 
of fit around the boundary will give a good description 
of the population at time t. Applying Eq. (18), we find 



nt{E) w ao/o(ao[-E - At(0])- 



where 



a; < 
a; > 



(19) 



(20) 



is the universal pulse shape (Fig. 1). The dynamics of 

the system can now be described as the motion /i(f) of 
a pulse of almost constant shape. In the limit of large 
t (and mean field), the pulse propagates at the constant 
speed Cq. 



A. Finite time 

A population that is located in a small energy range 
will initially propagate slower than the maximum speed 
Co. As described above, the pulse shape is almost con- 
stant except for at very early times, thus the speed dif- 
ference is equivalent to the motion of the boundary in 
Eq. (8). This is easily modeled, to the lowest order cor- 
rection: For a fixed boundary, Eq. (8) has a straightfor- 
ward solution, with a peak of height ~ 7 a distance ~ \/t 



Deviation vs. time Deviation vs. logarithmic time 





FIG. 2: Finite time corrections to boundary position; mean 
field and finite population size simulations compared to theory 
prediction (plus arbitrary constant). Population sizes (left to 
right): N = 2"-\ 2^" , 2^'''\ 2^°\ ... , 2'"'°. 



from the boundary, for which the slope at the bound- 
ary is [Specifically, nt{E) = j-^g{-^), where 

g{x) = —xe^^.] The normalization condition given by 
Eq. (18) now gives the correct location of the boundary, 
jl{t) ^ + Mo I and the speed reduction is simply the 

time derivative of this. The logaritmic dependence on 
time indicates that the error caused by intially fixing the 
boundary is negligible for large t {§i\/li > 
but for small t this error will be significant. 

Agreement with simulations is excellent (Fig. 2); for 
finite populations, the speed settles down at cat once the 
pulse width ~ ^Jyt reaches A^r (see section V below). 



FIG. 3: Alternate acceptable solutions to a continuous deriva- 
tive boundary condition at a; = for discrete dynamics. 
aoeo — 1. 

The mutation rates p{e) must apply equally to all the 
DNA in the population and must be zero unless e is an 
integer times eo- However, if the minimum change in 
binding energy (due to mutations) is eo, then there is no 
way to enforce the boundary condition 5£./i((i?)|^^-jjj — 

a^e~"o^, which follows from Eq. (21), on any finer 
scale^, and options 1 and 2 in Fig. 3 are a priori equally 
valid solutions. More precisely, these are the two extreme 
possibilities for average slope, and the real solution will 
be somewhere in between: 



nt{E) = ht{E) - ht{E - eo) « ^ j , (22) 



B. Discrete binding energies 

While deriving the diffusion Eq. (8), we assumed that 
the DNA could have any binding energy. However, many 
models, as well as my simulations, only consider discrete 
values isQ of the binding energy, corresponding to i mis- 
matches from the WT sequence. Equation (19) clearly 
cannot be the correct pulse shape for the discrete case, 
as the relevant normalization condition rif (ieo) = 1 is 
satisfied only for discrete values of n{t). 

A more convenient approach for the discrete case is to 
consider the cumulative population distribution ht [E) ~ 
J^^nt{E')dE'. From Eq. (1) it follows that ht{E) also 
obeys Eq. (1), but with a different boundary condition 
dEn't{E)\E=fi{t) — and the simple normalization con- 
dition h{fi{t)) = 1. We can perform the same transfor- 
mations as before [ht{E) = ht{E)e°'°'^^-''°*'> etc.] and 
replace the boundary condition with a ceiling: 

dtht{E)^^dlht{E), ht{E)<e'^°^ yS, (21) 

which no longer requires a separate equation for the 
chemical potential. Once we have the solution for ht{E), 
and thus ht{E), we can find the correct population dis- 
tribution given any discrete set Ei of binding energies 
(we assume that the energies are ordered, Ei < E'i+i, for 
convenience): nt{Ei) = ht{Ei) - ht{Ei^i). 



( {I + dx)ey'= - {I + d[x - l])ey^''-^^ x<0 
fy{x) = < l-{l + d[x- l])ey'= < x < 1 

[ a; > 1 

(23) 

where 

-y<d<-{l-e-y). (24) 

Note that /i(t) is here the binding energy at which ht{E) 
reaches 1, not the value at which nt{E) reaches (they 
are identical for the continuous case, but for the discrete 
case, at zero temperature, the real chemical potential is 
not a continuous function of time). 

For small aoeo, the two extrema are close, and the 
pulse shape closely matches that of the continuum so- 
lution [Fig. 4(a)]. For large aoeo, the upper boundary 
d ~ —(1 — e^""'") gives the better approximation, al- 
though any acceptable value of d gives a qualitatively 
correct solution [Fig. 4(c)]. 

This approximation involves two dimensionless param- 
eters: the primary parameter aoeo comes from specifying 
the energy scale eo , but there is also a secondary parame- 
ter coeo/7 which comes from specifying a unique "at rest" 



The approximation I used between Eqs. (6) and (7) gives a cor- 
rection to tile solution near the boundary even for the continuum 
case. 
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FIG. 4: Numerical results and analytical estimates for dis< 
binding energies (see text). Left side: The pulse shape n 
X — {E ~ /i(t))/eo. Right side: The boundary conditioii lui 
ht vs. X. (a) aoeo = 0.1, coeo/7 = —0.25. (b) Qoeo = 1, 
coeo/y = —0.025. (c) aoeo = 10, coeo/7 ~ -2. 
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FIG. 5: An initial perturbation given by a delta function (a) 
will spread as a Gaussian (b) until it encounters the boundary. 
As some time ~ the slope at the boundary will be maximal 




lo" io'° 10^" 10^^" 10^° 10=" 10^° 10™ 10^'^ lo"- 

N* 



coordinate system. The slope d — —{I — e "'''^°) is the 
Hmit for coeo/7 — > 0. 

A large value of ao^o corresponds to low mutation rate 
and strong selection. Given the constant propagation 
speed from Eq. (15), we can also consider the number 
of DNA molecules with a given energy Ei as a function 
of time: Ni{t) = Nfageoixifi - Cot/eo). For large aoeo 
(and Co small and negative), the general behavior of this 
function agrees with the argument in [4] [Fig. 4(c)]: Ni 
grows exponentially until Ni « N, i.e., almost the entire 
population has been replaced, and then remains almost 
constant until the population is replaced again. 



V. FINITE POPULATION SIZE 

As discussed in the introduction, the dependence of the 
propagation speed on the population size is of significant 
interest. In reference [11], Peng et al. estiirrate the effects 
of the population size by using a cutoff procedure [19] 
that ignores amplification for the part of the distribution 
where n{E) < -i-.^ While this is a reasonable approach 
and gives a result that matches simulations fairly well, it 
is not clear whether this is the correct cutoff [19]. 

A better approach is to base the cutoff procedure ex- 
plicitly on the key criterion: when will mean-field theory 
fail? Consider what happens to a single DNA molecule 
with a binding energy A less than the selection threshold 
fiQ, according to mean- field theory. The molecule corre- 
sponds to an initial perturbation Sno{E) — ■^S{E—[^o — 



FIG. 6; Deviation from mean-field propagation speed as a 
function of effective population size A''* (see text); theory re- 
sults and scaling collapse of simulations (small circles). 



A]), where 6{-) is Dirac's delta function, and the trans- 
formed perturbation Sfio{E) = ±e''°'-'^-i^°^S{E - [/io - 
A]) will evolve according to Eq. (8) with a fixed bound- 
ary /i(t) = /io- Simple scaling requires that the slope 

at the boundary, ^^^^^^ \ E=fig i reaches a maximum value 
~ ^e^ot'^-'^oV^^ after a time t'^'''^ - A2/7, as shown 
in Fig 5. According to Eq. (17), this corresponds to a 



population size perturbation of SN'^ 



max 
A 



e"»^/(aoA)' 



descendants (for aoA 3> 1). For A large enough, 
jjymax ^ jy^ i.e., the desccudauts of a single molecule 
will eventually replace the whole population, and the se- 
lection threshold must "jump" to keep the population 
size constant. This is a clear violation of mean field, 
and the condition (5iV™^'^ — N gives a lower boundary 
/i2(t) = n{t) — An beyond which mean field fails. The 
best estimate of the propagation speed that we can find 
from mean field is then the propagation speed Cf,^^^) of 
a pulse of length An = An - iA^''(cb(Ajv) ~ ^o), where 
6(Ajv) = tt/Atv; this includes a correction for the fact 
that the boundary in eq. (8) would actually move:^ 



cn ~ Co 



7Q;o7r 



(ln[iVln^(Ar)] -7r2/6)2 



(25) 



As shown in Fig. 6 ("Theory"), this estimate is very 
accurate. It slightly overestimates the correction, since 



A better version of this cutoff uses h{E) < j^, which does not 

depend on the energy scale/resolution. * Numerically, t^^'^ Ri A^/(67). 
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FIG. 7: Motion of boundary fj,{t) relative to estimate 
[Eq. (25)] for simulations with different population sizes. 

it ignores the contribution from the mean-field- violating 
events, but it is far better than the estimate in [11] ( "1 /N 
cutoff'). 

Fig. 7 shows the difference between the actual value of 
/i(t) and our estimate /Uq + cjv^ as a function of time for 
two simulations with different populations sizes. These 
plots support our discussion above — the difference re- 
mains almost constant for long periods, but then it sud- 
denly makes a large jump, i.e., there is an event that 
violates the mean- field assumption. Note that all jumps 
of size less than ~ — are included in the estimate Cm. 
The larger population size has a much smoother behav- 
ior than the smaller one, which is expected since the 
time required to propagate a random event scales as 

tmax- A2^~ln(iV)2. 



I , \ , \ , b I 

-2-10 1 

X 

FIG. 8: Effective boundary for discrete selection (see text). 

instance error-prone PGR, then there is no further 
correction — this corresponds to the continuum formula- 
tion. If, however, the mutations are introduced sepa- 
rately, after replication, then the population essentially 
evolves according to mean field between selections, and 
the second boundary behaves the same way as the first 
boundary, i.e., the effective cutoff is a distance S from 
the nominal value. The actual population by the second 
boundary is negligible, thus the effective population is 
simply N* = N'e°"'^. 

The high quality of the scaling collapse in Fig. 6 shows 
that the second boundary /i2(^) very accurately cap- 
tures the effects of a finite population on the propaga- 
tion speed: Figure 6 includes simulations for which the 
N* /N' corrections, which arc entirely due to the second 
boundary, vary from close to 1 to about 1600, and the 
collapse is accurate to better than a factor of 2 in the 
population size. Note that N* /N' may be larger than 
the amplification factor K, thus the effective population 
size can be larger than the largest population that occurs 
during the evolution, KN. 



A. Effective population size 

The argument above is based on the continuous time 
formulation of the evolution process. There are two types 
of corrections to this formulation: The selection is only 
done at discrete intervals, which applies also to mean 
field, and the mutation and replication may be done in 
various different ways, which changes the statistical prop- 
erties of the random noise. 

The selection process only affects the tail of the 
propagating pulse. If we solve the diffusion equation 

^•^g^'*'' = ^ 3^2'*^ numerically with the boundary con- 
dition /(x, t) = —X for X <C and set f{x, t) to zero for 
X > at integer values of t, the function at integer values 
of t will be as shown in Fig. 8. The effective boundary 
position is at X « 0.824, and ~ 1.21. This yields an 
effective population size (the population size correspond- 
ing to the dotted line in Fig. 8) N' k, N ^^"^ 2°iao<? ' '^^^^^ 
S = 0.824^ — this is accurate for N large enough that 
h{E) has almost constant slope at least a distance 6 from 
the boundary, i.e., for N ^ 

If replication and mutation are a single process, for 



VI. FINITE LENGTH 

The strongest assumption used to derive Eq. (8) was 
that the mutation rate p{e) is independent of the ini- 
tial binding energy, or, more generally, the genotype of 
the molecule that is mutated — I have yet to prove that 
this formulation can capture the behavior of finite sys- 
tems. While it is difficult to prove this in general, since 
the above assumption was precisely what allowed the use 
of an otherwise fully general mutation rate p{e), we can 
again consider the cases of high/low mutation rate and 
weak/strong selection. 



A. High mutation rate/weeik selection 

Consider the discrete, finite model described in sec- 
tion lie. As mentioned in section III A; for high muta- 
tion rate and weak selection, it is convenient to describe 
the mutation rate of an infinite system in terms of its 
diffusion coefficient D and drift speed v [Eq. (10)]. Using 
the same approach in a finite system, D and v will depend 



on the number of mismatches r [11], thus the constant 
speed solutions we have found wiU not be vaUd. How- 
ever, as long as the parameters vary slowly on the scale 
of the characteristic length — of the infinite model, we 
expect these solution to be good local approximations. 



1. Exponential approach to equilibrium 

In [11], Peng et al. argued that for small k, an initial 
population that is compact in mismatch space and far 
from equilibrium will form a moving pulse that exponen- 
tially approaches the equilibrium position; {t) — r^^ ^ 
e~'^^'^° , where ro{t) = ii{t)/tQ is the position of the 
boundary in mismatch space. If we numerically solve 
Eqs. (11-13) for ao, cq and 7 as a function of r, while 
keeping k constant, we indeed find that for a sizable range 
of k, Co depends almost exactly linearly on r, thus 



200 



droit) 
dt 



co(ro(t)) 



dco{r) 
dr 



[roit) 



(26) 



which leads to ro(i) — r^^ 



The exponential 



constant — agrees well with simulations (Fig. 10) 
and will in general differ from the estimate — found 
in [11]. wliic:li is the limit for k 0. 

Surprisingly, the linear dependence of Co(r) on r is 
most accurate not for very small fc; rather, there is a 
finite value of k at which co(r) and 7(r) are perfectly 
linear, while ao is constant and given by the solution to^ 



Simulation results 
Lower estimate r^^:^ 

Upper estimate min r+7t/p(r) 
Average estimate 
Original estimate by Peng et al. 
Corrected estimate for Peng et al. 




K = e'' = 1/(|) 



FIG. 9: Equilibrium position for the boundary as a function of 
amplification K; estimates and simulation results, as discussed 
in the text. L = 170, v = 0.01. 



However, the entire population has lower mismatch num- 
ber than ro, which means that the solution to Eq. (28) is 
a lower bound for r^*^ . 

Imposing the equilibrium condition = on a 
hounded solution yields 



k = 2D{r) - ^/AD{rf - v{rY cos(6). (29) 



1- (l-Fao)e~"° = — (1 -l-aosinhao - coshao). (27) 

This would give a perfectly exponential approach to equi- 
librium of a perfectly shape-preserved pulse at this single 
value of k (if we ignore corrections from finite time, dis- 
crete mismatch mimbers, etc.) the variation of 7 docs 
not affect either shape or movement once the pulse has 
reached its constant shape. This is also the value of k 
that gives the smallest ^^^7^, i.e., the slowest conver- 
gence. Simulation results verify that the exponential ap- 
proach is near perfect for this k, and better than for either 
weaker or stronger selection (not shown). 



By solving the above equation for &, we find a population 
that lies between r and 7-+ i.e., the entire population 
has higher mismatch number than the position at which 
the equilibrium parameters occur. Its upper bound r + 
is thus an upper bound for r^*^, and to find the best 
upper bound we minimize over r. Our estimate for the 
equilibrium value of tq is then the average of the lower 
and upper bound: 



^ + J min r+ (30) 

2r>reo=o &c6=o(r-) 



2. Equilibrium position 

The simplest estimate we can make for the equilibrium 

value r^^ is the position at which the values of D and v 
are such that the propagation speed cq would be [from 
Eqs. (11-13)]: 

k = 2D{r) - ^/4D{r)^ ~v{r)^. (28) 



As shown in Fig. 9, this estimate is very accurate for most 
values of = K = e'^; for K > 1.02, the discrepancy is 
always less than 1.^" The lower bound given by Eq. (28) 
is equal to the "second region formula" in [20]. In Fig. 9, 
I also include the estimate found in [11] for comparison, 

as well as a "corrected" version in which - — — has been 

T 

replaced by fc = In i (as in section HI A); this second 
version is very close to the lower bound for small k. 



9 Solve the r-derivativcs of Eqs. (11-13) using the conditions ^ = 10 p^j. ^^j-y ^i^^^ ^o 1, ao is small, thus the assumption that D{r) 
— ^ ^Zi ) §^ = — i^o and §f = 0. and v{r) vary slowly on the scale of ^ does not hold. 
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B. Low mutation rate/strong selection 

Reference [4] describes simulations of in vitro molec- 
ular evolution based on a realistic model of the binding 
energy of a DNA sequence and the Mnt protein and found 
that the evolution rate, in terms of number of iterations 
required to reduce the average number of mismatches by 
one, was very close to constant in the mean-field limit. 
According to Eq. (15), the propagation speed should de- 
pend logarithmically on the beneficial mutation rate p_ , 
in which the number of remaining mismatches m is a fac- 
tor. However, this equation describes the rate of evolu- 
tion of binding energy, not of the number of mismatches, 
and it only holds when all beneficial mutations confer the 
same decrease in binding energy. 

It is easy to show that decreasing the rate/benefit 
of beneficial mutations and increasing the rate/penalty 
of deleterious mutations will always decrease the en- 
ergy evolution rate: — co(p2) < —co{pi) whenever 
J^^ep2{e)de > J^^epi{e)de VE. This does not hold 
for the mismatch evolution rate 

^.;^e---(^)|-;^e-«-(^)|, (31) 

where M± are the sets of mutations that would correct a 
mismatch (— ) / introduce a mismatch (+), and ^ (= g 
in [4]) is the rate of each specific mutation. If there is a 
single mismatch that is associated with a much larger en- 
ergy penalty than any other, then mutants that have cor- 
rected this mismatch will outcompete all other mutants, 
even ones that have corrected multiple other mismatches. 
This may lead to a lower initial rate of mismatch correc- 
tion than if there was no such extreme mismatch, or if it 
could not be corrected. 

For realistic models such as the one in [4], the vari- 
ous mismatches are associated with a wide range of en- 
ergy penalties. If the mutation rates are similar for the 
different mismatches, then the mismatch evolution rate 
may indeed be very close to constant during most of the 
evolution. For the intial population used in [4], which 
has 6 mismatches, the effective number of mismatches 
(the number of mismatches that would give the observed 
mismatch evolution rate if they all had the same energy 
penalty) is less than 2.5 for a mutation rate of 10^'' (the 
lower the mutation rate, the lower the effective number 
of mismatches). 

VII. FINITE TEMPERATURE 

So far we have used perfect selection, i.e., we always 
keep the best N protein binders. This is a good approxi- 
mation if ^ 1, or, for discrete models, (Seq ^ 1. The 
more realistic situation of finite temperature is far more 
difficult to model, as we can no longer take the contin- 
uum limit in the formal manner described in appendix A. 
There are, however, two relatively simple limits. 



For sufficiently large populations, or in mean field, 
the population distribution will extend to energies much 
lower than the chemical potential, where the selection 
factor is practically equal to one. In this re- 

gion, the behavior is still governed by Eq. (8). We can 
consider the region around ii(t) to impose a non-trivial, 
but fixed, boundary condition, and the evolution rate 
will still be cq for mean field (cjv* for finite populations, 
where the first effective population size correction N' /N 
depends on the temperature, but not on A'^). 

On the other hand, for finite populations and very 
strong selection, all of the population will have binding 
energy much higher than fj.{t). We can then approximate 

P(E) = 1—^ « e'^^'^W-^). (32) 

Here, the relative fitness of two molecules is constant; 
P{Ei)/P{E2) = e'^^^^-Si). thus, this limit is equivalent 
to the traditional fixed-fitness models of evolution! The 
exponential dependence on E again allows a formal con- 
tinuum limit of the mean-field theory (see appendix Al). 
While this mean-field theory is not a good approximation 
for any finite population size [21], one can apply the same 
criterion as in section V, SN"^^^ = N, to find boimded- 
from-below solutions that capture the behavior of finite 
populations^^. This is an alternative approach to the 
statistical criterion used in [10]. 

We can easily find the propagation speed of a single 
molecule (for the discrete evolution process): The selec- 
tion probability is exponential in the energy, and this 
simply alters the effective mutation rate (mutating with 
rate r and then selecting the mutants with probability p 
times the probability of selecting the original is equiva- 
lent to mutation rate rp), giving the propagation speed 

/oo 
ee-^''p{e)de. (33) 
-oo 

The propagation speed for a single molecule has the same 
form as the propagation speed cq in the limit of large pop- 
ulations, when the approximation in Eq. (32) no longer 
holds. This immediately shows that for /3 > ao, the ap- 
proximation makes no sense. 

VIII. DISCUSSION 

In the preceding sections, we have seen how the as- 
sumption of a translationally invariant mutation rate 
leads to a simple set of equations describing the mean- 
field theory of competitive evolution, and that this pow- 
erful formulation can give very accurate predictions in 
many different situations. The effective dynamics of the 



The population size perturbation must here be calculated in the 
presence of the boundary. 
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evolution process are to a good approximation captured 
by the rescaled mutation rate p{e) = e^""'^p{e). Far from 
equilibrium, or for small populations, this is the rate at 
which mutations are fixed in the population, while for 
a sufficiently large population close to equilibrium, this 
gives the relative occurrences of different bases; For an 
additive energy model, the equilibrium Piib^) / Pi{h\) be- 
tween two bases hi , 62 at position i that have an energy 
contribution difference of 5Ei = £^^(62) — Ei{bi) will be 
shifted by a factor e"^"""^^* relative to the equilibrium 
Pi (^2)/Pi (&i) for pure mutation, without any selection. 

One significant result was the accurate prediction of 
the limits of mean-field theory for a finite population and 
the way in which it fails. Namely, when a molecule by 
chance gets sufficiently far ahead of the general popula- 
tion, the population is quickly replaced by the descen- 
dants of that single molecule, evidenced by a jump in 
the selection threshold. This result has important conse- 
quences for the maximum genetic variation that can build 
up during competitive evolution: Whenever such a jump 
occurs, essentially all prior genetic variation is wiped out. 
For instance, for an organism that reproduces mostly by 
asexual growth^^ but with periodic sexual reproduction 
stages, this sets a lower limit for the frequency of sexual 
reproduction necessary to maintain genetic diversity. 

Interestingly, the parameter values of the experiment 
performed by Dubertret et al. [13] place it in the high- 
temperature regime described in section VII, which has 
the same dynamics as the traditional fixed-fitness models 
of evolution: Even though the population is fairly large 
(pa 10^), the very large amplification (2^^) means that 
during selection, even the strongest protein binders may 
be removed. Also, the system used is only slightly larger 
than the propagating pulse, thus finite size effects will 
likely be severe (the pulse length is far more important 
in the high- temperature regime than in the T = limit). 

While the global translational invariance of the model 
presented here is a very strong assumption, most of the 
results will be locally accurate for any system for which 
the mutation rates p{e) change slowly on the character- 
istic length scale I/qq- As shown in section VI, one can 
fairly easily estimate corrections when the energy depen- 
dence of the mutation rate has a known, simple form, 
yielding even more accurate results. 

In summary, I have presented a simple model that ac- 
curately captures the key qualitative and quantitative 
features of asexual competitive evolution on a smooth 
landscape. This allows prediction of the performance of 
competitive evolution for a wide range of parameters for 
any given, reasonably smooth energy landscape. Addi- 
tionally, this forms a baseline against which the perfor- 
mance of more complex evolutionary procedures, e.g. us- 
ing recombination [12], can be compared. 
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APPENDIX A: CONTINUOUS TIME 
APPROXIMATION 

As before, let nt{E) be the normalized population den- 
sity at time t. Formally, an evolution cycle can be repre- 
sented by the operators 11 = K (reproduction), M (mu- 
tation) and S (selection): 

nt+i = SMRnt. (Al) 

Both M and R can be written as the exponentials of 
other operators^^ [R = e"", where r = fc = ln{K), and 
M = e™], while S = S^. This allows us to generalize 
Eq. (Al) to the approximate equation 

nt+ At = Se^*'"e^*'-nt, (A2) 

which in the limit At becomes 

dtut = (m + r)nt , nt[fi{t)] = (A3) 

— since selection is done via a step function, the contin- 
uum limit is the boundary condition nt[fi{t)] = 0, where 
fi(t) is dynamically chosen such that nt remains normal- 
ized, m and r commute, thus the only sources of error 
in this approximation are the commutators with the se- 
lection operator/boundary condition, and these are lo- 
calized around the bondary. By rescaling time we can 
identify classes of models that are equivalent within this 
approximation: 



ti = t/l (A4) 

R; = e'"' = A:' (A5) 

Ml = e'"" (A6) 

Si = S. (A7) 



The continuum model [Eq. (A3)] is a scaling limit of 
this class, and all models within the class that are suffi- 
ciently close to this limit (i.e., R and M are close to 1) 
should be well approximated by it. Figure 10 shows sim- 
ulations of various models from the same scaling class; 
M is parametrized by j/q, which is rescaled according to 

^o,; = ^[l-(l- A^o)']. ^ 

The continuum approximation fails if rescaling by a 
factor of 2 is no longer accurate, i.e., if SM2AtR2Af'^t is 
significantly different from SMAtRAtSMAtRAt^it- The 
difference is that molecules that in one case are removed 



Where the asexual growth can be described by competitive evo- 
lution. 



Even in situations where M can not be written exactly as an 
exponential, this is a good approximation as long as the mutation 
rate per base per iteration is small. 
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FIG. 10: Simulations of different models in the same scaling 
class, plotted on appropriately scaled time axes. The scaling 
factor / is in parantheses. For small I, the approach to equi- 
librium is very well fit by an exponential with the constant 
— ^-§^, as expected (section VIAl). 



after At in the other case will survive until time 2 At, but 
will then face a stronger selection. These molecules can 
only make a difference if they reach the leading (low E) 
edge of the population. As they start out at the tail of the 
population at time At, there are only two ways in which 
they can reasonably reach the leading edge of the popu- 
lation: Either the propagating pulse is very short, which 
happens only in finite populations, or the mutations are 
strong enough to almost equilibrate the distribution, i.e., 
vq is large (L is finite). In Fig. 10, the simulations devi- 
ate significantly from the continuum limit only for vq on 
the order of 1. 



1. Finite temperature 

As noted above, the discrete evolution process can be 
represented using operators: 



nt+i = SMRnt = Se'"e""nt. 



(A8) 



For large temperatures, very strong selection and finite 
populations^'', Eq. (32) allows us to write the selection 
operator as S = Ne^^^, where the normalization oper- 
ator N simply rescales a function; N f = — ^ , and 

^ J f{E)dE' 

E is the binding energy operator; E/(i?) = Ef{E). The 
differential mutation operator m can be explicitly writ- 



/oo 
p(e)A,de, 
-OO 



(A9) 



with the operators given by [A.^nt\{E) = nt{E — e) ■ 
nt{E). Using the commutators 



[E,A,] = eA, 

[s, r] = [m, r] = 



(AlO) 
(All) 



we can now collect the operators in one exponential: 
SMR = Ne-'^=^e'"e"- = Ne-'3Ee/p('^^''''e'= (A12) 



= Nexp 



1 - e-P' 



p{e)A,de - /3E 



which immediately gives the continuum equation 

/•OO 

dtnt{E) = / p{e)nt{E^e)de+l3[fi{t)^E]nt{E) (A13) 



for fl{t) such that rit remains normalized, where p{e) = 
Note that we have collected several E- 
independent factors in ij.{t), including R = K and factors 
from the commutators. The discrete evolution equation 
only specifies that nt must be normalized for integer t, 
but we can choose fl(t) such that nt is normalized for all 
t. 

As before, there are classes of models that are equiva- 
lent. Note that rescaling time now also rescales temper- 
ature, and the mutation rate is rescaled in a nontrivial 
manner^^. Given Eq. (32), the mean-field behaviors of 
"equivalent" models are indeed identical, and the only 
effective difference between equivalent models is how of- 
ten we restrict the population size to N . However, since 
the selection is highly competitive, i.e., we remove some 
DNA that is as good as or better than what remains, 
that difference is significant. 

While the mean-field theory has a formal continuum 
limit, the evolution process itself is inherently discrete 
in this regime: Toward the continuum limit, frequent, 
strong selection will cause a blow-up of the statistical 
noise from the selection process. This is in contrast to the 
traditional evolution models, for which the evolution can 
be described by a continuum process from the start [9, 
21, 22]. 



^'^ Indeed: this is a moan-field theory that is only applicable to finite 
populations! 



Two models are equivalent if and only if pi //3i = P2/f^2. 
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